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ABSTRACT 


Data mining of course enrollment and course description 
records has soared as institutions of higher education be- 
gin tapping into the value of these data for academic and 
internal research purposes. This has led to a more than 
doubling of papers on course prediction tasks every year. 
The papers often center around a single prediction task and 
introduce a single novel modeling approach utilizing one or 
two data sources. In this paper, we provide the most com- 
prehensive evaluation to date of data sources, models, and 
their performance on downstream prediction tasks. We sep- 
arately incorporate syllabus, catalog description, and enroll- 
ment history data to represent courses using graph embed- 
ding, course2vec (i.e., skip-gram), and classic bag-of-words 
models. We evaluate these representations on the tasks 
of predicting course prerequisites, credit equivalencies, stu- 
dent next semester enrollments, and student course grades. 
Most notably, our results show that syllabi bag-of-words 
representations performed better than course descriptions 
in predicting prerequisite relationships, though enrollment- 
based graph embeddings performed substantially better still. 
Course descriptions provided the highest single representa- 
tion accuracy in predicting course similarity, with descrip- 
tions, syllabi, and course2vec combined representations pro- 
viding the highest ensembled accuracy on this task. 
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1. INTRODUCTION 


Data from institutions of higher education are quickly com- 
ing into focus for educational data mining and learning an- 
alytics communities as the utility of these data start to be- 
come clear and attention begins to shift from the informal 
learning context of free online courses to the higher stakes 
context of degree granting institutions and their students. 
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Educational Data Mining (EDM) plays an important role 
in the developing stages of methodological adaptation to a 
domain by evaluating new sources of data for their utility 
in existing models and tasks and updating the utility of ex- 
isting data as models and tasks evolve. Recently, EDM has 
seen a more than doubling year-to-year in papers focused on 
prediction with large institutional enrollment sets from the 
formal higher education context, with a single paper on the 
topic in 2017 [38], two in 2018 [12, 6], and five in 2019 [29, 
36, 19, 37, 16], though early pioneering work on predicting 
academic outcomes date back to the first EDM conference 
[39, 2]. 


In this paper, we summarize and evaluate this quickly de- 
veloping domain across three dimensions: sources of insti- 
tutional data, models for representing students and courses, 
and the performance of the former two categories on institu- 
tionally relevant prediction tasks. As academic researchers 
and practitioners know, not all sources of data are always 
available and different costs are associated with obtaining a 
new source. Similarly, when it comes to modeling, different 
personnel and computational costs are associated with ap- 
plying models depending on their complexity and recency of 
introduction. We provide the most comprehensive evalua- 
tion to date of the performance of different combinations of 
data and models on common institutional tasks emerging in 
the literature so that the costs and benefits of each, in our 
setting, can be quickly apprised. In addition to evaluating 
previously introduced approaches and data, we introduce 
large scale syllabus data as a novel source of information 
about courses and a novel application of a nascent graph- 
embedding approach for representing courses. 


2. RELATED WORK 


Contemporary approaches to data mining institutional data- 
sets in higher education have distinguished themselves from 
earlier drop-out detection work [18] in the use enrollment 
data and adoption of representational methods that fac- 
torize, embed, or otherwise vectorize courses into a space. 
This began with [10] that used matrix factorization applied 
to student enrollments and observed that the factorization 
grouped courses and students in semantically meaningful 
ways. Subsequent research also employed matrix factor- 
ization for grade prediction tasks [38, 37]. Neural embed- 
ding models followed, with the skip-gram neural network 
model applied to sequences of course enrollments, an ap- 
proach coined “Course2vec” [32]. The course embeddings 
extracted from this model were found to be predictive of on- 
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Table 1: Related work on institutional prediction tasks (columns) and sources of data used in the task (rows) 


Grade prediction 


Enrollment prediction Prerequisite prediction Course similarity 


Course grades [10, 15, 21, 38, 37, 16] = [10, 3, 32, 36] 
Enrollment histories | [10, 15, 21, 38, 37,16] [10, 3, 32, 36, 1] 
Major declarations [38, 21, 37] [32, 36] 
Catalog descriptions [32] 


time graduation [25], course similarity within [33] and across 
institutions [31], and of latent topics of courses [8]. Student 
course selections have also been posed as a graph, treating 
courses as nodes and student course selections as strength- 
ening the edges between courses the more frequently they 
share students in common [15, 16, 1]. The aforementioned 
approaches all use student course selections, a collaborative 
signal, to represent a course. Other approaches utilize con- 
tent data of a course (e.g., catalog description) for represen- 
tation and for downstream tasks such as course similarity 
analysis [26, 31, 33, 12, 29] and enrollment prediction [32]. 
Several papers have collected course ratings for modeling 
and recommendation [13, 12]. 


The majority of models in related works have been framed 
as potentially contributing to a course recommendation sys- 
tem, or already integrated into one. They commonly focused 
on grade prediction [10, 15, 21, 38, 37, 16] as a necessary 
first-step towards a preparation, or goal-based [21] recom- 
mendation system that could aid students in preparing for 
difficult courses. In a similar vein, prerequisite course in- 
ference has been framed [22, 11, 21, 16] also as a potential 
means to help guide students towards course taking paths 
expected to be more successful than others [11, 30]. Table 
1 summarizes this body of work in terms of the most com- 
mon data sources used (i.e., course grades, enrollment histo- 
ries, major declarations, and catalog descriptions) and most 
common evaluation tasks (i.e., grade prediction, enrollment 
prediction, prerequisite prediction, and course similarity) fo- 
cused on in this paper. 


3. DATA SOURCES 


In this section, we will describe the three primary sources of 
data utilized in this paper. First, we will describe the source 
generally, followed by a paragraph detailing the particulars 
of the dataset used in our offline evaluation experiments. 


3.1 Enrollment histories and grades 

A student’s transcript is classically a report containing the 
student’s histories of courses taken and the grade achieved 
in each. Enterprise database systems often store raw forms 
of these data. It has become more common for institutions 
to not only store these data in relational form but for their 
internal offices of institutional analytics to have ready access 
to them. As the fields of EDM and learning analytics have 
grown, these data have become more available to faculty to 
aid scholarly research. We used an anonymised enrollments 
and grades dataset containing student enrollment histories 
at a large public university, UC Berkeley, collected from Fall 
2008 through Fall 2017. The dataset consists of per-semester 
(ie., Fall, Spring, and Summer) class enrollments for 164,196 
students (both undergraduates and graduates) with a total 
of 4.8 million class enrollments. A class enrollment record 
in the data indicates that the student was still enrolled in 
the class at the end of the semester. The action of drop- 


[22, 11, 21, 16] [17, 25, 12, 29] 
[22, 11, 21, 16] 17, 25, 31, 33, 29] 


[21] O4 
[26, 31, 33, 12, 29] 


ping a class is not contained in these data. The median 
number of classes enrolled by a student in a semester was 
four. There were 9,478 unique lecture courses from 214 de- 
partments hosted in 17 different Divisions of 6 different Col- 
leges. Course meta-information was also included in these 
data and contained course number, department name, class 
instructor(s), and room max capacity. In this paper, we only 
consider lecture courses with at least 20 enrollments total 
over the 9-year period, resulting in 7,487 courses. Although 
courses can be categorized as undergraduate courses and 
graduate courses, undergraduates are allowed to enroll in 
many of the graduate courses. Enrollment data were sourced 
from the campus’ enterprise data warehouse. 


3.2 Course catalog descriptions 

A paper catalog use to be the primary way in which students 
could browse all the course offerings at an institution. For- 
tunately, this has been superseded by online catalogs, most 
of which are searchable. The catalog contains course num- 
bers, their hosting department, and typically a paragraph 
or type description of the course. Our dataset contains the 
most recent catalog description of every course in our en- 
rollment histories. The average catalog description length 
was 325 words with 489 courses having exceptionally short 
descriptions of 10 words or fewer. We sourced these descrip- 
tions from the campus Office of the Registrar official API for 
Course information. These descriptions were pre-processed 
by (1) removing generic, often-seen sentences across descrip- 
tions (2) removing stop words (3) removing punctuation, 
and (4) word lemmatization and stemming. 


3.3. Course syllabi from the Learning Manage- 


ment System 

A course syllabus is a detailed, chronological list of subjects 
and assignments that a course will cover, often with other 
logistical information about course meeting place and time 
and grading policies. While the syllabus is perhaps an ideal 
source of information to utilize for content-based represen- 
tation of a course, it has been an elusive source to conduct 
research on. This is because few institutions mandate that 
instructors make their syllabi public and therefore it is un- 
common to have syllabi centrally stored by the institution to 
subsequently make available to researchers. An additional 
barrier to research availability is that many institutions view 
a syllabus as an instructor’s intellectual property (IP), and 
therefore not sharable in original form without permission. 
Our study introduces syllabus data into contemporary pre- 
dictive models and tasks, but with a caveat that maintains 
instructor control over the original intellectual property. 


The university from which our syllabus data come from con- 
siders syllabi to be instructor IP and does not collect them 
centrally. However, a common place in which instructors 
often place their syllabi is the “Syllabus” page of the cam- 
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pus Learning Management System (LMS). We worked with 
the campus technology services organization in charge of 
the LMS to extract all text from the Syllabus pages of all 
courses. Sometimes this page would contain only a link to 
the pdf of a syllabus, in which case that link was down- 
loaded and parsed to text. To abide by the IP restrictions 
around course syllabi and respect instructor ownership of 
them, a workaround was arranged. Only the technology ser- 
vices would have access to the cleanly parsed data from the 
LMS. They would then pre-process the syllabus themselves, 
similar to how we pre-processed catalog descriptions, pars- 
ing out html, converting it into bag-of-words (BOW) form. 
This form would thereby make the syllabus unusable as an 
instructional object but potentially usable by an algorithm 
attempting to extract information for institutional predic- 
tion tasks. It was also agreed that the BOW we received 
would not be made public and these data could be revoked 
at any time. There were 3,645 unique courses that contained 
HTML on the LMS Syllabus page, not including a link to 
a file. There were 2,712 courses that contained a link to a 
file, with some courses having both. The total number of 
courses with some amount of syllabus data was 4,017 with 
a combined vocabulary of 17,194 unique words. 


4. REPRESENTATION MODELS 


We choose four approaches of increasing complexity for rep- 


resenting courses. These four reflect the most common paradigms 


of modeling found in our literature review. The simplest is 
a content-based bag-of-words representation of the course. 
The BOW approach could be applied to the catalog descrip- 
tion or syllabus of a course, where available. Next is the use 
of a recently published variant on Course2vec called multi- 
factor Course2vec, which applies a skip-gram to sequences of 
course enrollments. In addition to embedding courses, mul- 
tifactor Course2vec also embeds the instructor of the course 
and the course’s department, both presented to the model 
in the form of a one-hot encoding. Multifactor Course2vec 
has been shown to perform better on course similarity tasks 
than the original Course2vec [33], in theory because it sep- 
arates out factors, such as instructor and department, al- 
lowing the course embedding to more purely represent the 
content. Long Short-Term Memory models are the third 
model used to embed courses, followed by a recently intro- 
duced network embedding technique. 


A summary of the approaches used is visually illustrated in 
Figure 2. The various types of information these methods 
leveraged are summarized in Table 2. 


Table 2: Summary of representative learning meth- 
ods for courses 


wr A ) Ss x 
bag-of-words | Vv v static 
multi-c2v v v dynamic 
LSTM v dynamic 
sc-AMHEN v v static 


4.1 Bag-of-words 

The basic representation mode of bag-of-words was proposed 
by information retrieval researchers for text corpora. It is a 
model that reduces each document in a corpus to a vector of 
real numbers, each of which represents a term, or vocabulary 
weight. The term weight can be term frequency, a binary 
value with 1 indicating that the term occurred in the docu- 
ment and 0 indicating that it did not, or a tfidf scheme[7]. 
There are two sources of texts that can represent the con- 
tent of courses: the course catalog descriptions and course 
syllabi. 


4.2 Multifactor Course2vec 

The Course2vec model [32] was proposed to learn distributed 
representations of courses from students’ enrollment records 
throughout semesters by using a notion of an enrollment se- 
quence as a “sentence” and courses within the sequence as 
“words”, borrowing terminology from the natural language 
domain. For each student, their chronological course enroll- 
ment sequence is produced by first sorting by semester then 
randomly serializing within-semester course order. Each 
course enrollment sequence is then trained on like a sentence 
using a skip-gram model. 


More features of courses (e.g., course instructor and de- 
partment) can be added to the input of the multifactor 
Course2vec model to enhance the classifier and its repre- 
sentations. The model learns both course and added feature 
representations by maximizing the objective function over 
all the students’ enrollment sequences and the features of 
courses, defined as follows. 


YDS DE deep (citslei, fit, fiars fin) (1) 


sES cies —w<j<w,j40 


Probability p(ci+;|ci:, fai, fia,» fin) of observing a neigh- 
boring course c;+; in window size w given the current course 
c; and its features fii, fi2,..., fin (e-g., instructors, depart- 
ment) can also be defined via the softmax function, 


Td 
exp(@; Uj4;) 
P(citg leis fers fia, +s fin) = 3H eplarel) (2) 
k=1 i Up 


h 
ai = v1 +0 Wrjxofiz (3) 


j=l 


where a, is the vector sum of input course vector representa- 
tion ve and all the features vector representations of course 
c, fiz is the multi-hot input of the j-th feature of course 3, 
and W,,,;xv is the weight matrix for feature 7. So by mul- 
tiplying W,,xv and fiz, it gets the sum of feature vector 
representations of the i-th course. The illustration of the 
model is shown in the multi-course part of Figure 2. v; is 
the course representation of course 7 learned from the model 
that is used in various down-stream course prediction tasks. 


4.3 LSTM-learned Representations 

In previous work [32], an LSTM was designed to recommend 
courses for students to take in the next semester, based on 
their enrollment histories. The input of the model in each 
time slice is a multi-hot vector representing the courses taken 


in the corresponding semester. The weights of the input 
W;, W:, W., and W. learned by the LSTM transferred the 
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Figure 1: [lustration of the Attributed Multiplex 
HEterogeneous Network (AMHEN) of Students and 
Courses. 


multi-hot input to the forget gate, input gate, output gate, 
and the cell in the LSTM cell, respectively. These four sets 
of weights are combined to form representations of courses 
that can be used in down-stream prediction tasks. 


4.4 Attributed Multiplex Heterogeneous Net- 
work Embeddings 


Network representation learning (i.e., network embedding), 
is a promising method to project nodes in a network onto a 
low-dimensional continuous space while preserving network 
structure and inherent properties. In terms of the network 
topology (homogeneous or heterogeneous) and attributed 
property (with or without attributes), six different types of 
networks can be categorized, i.e., HOmogeneous Network 
(HON) [34], Attributed HOmogeneous Network (AHON) 
[40], HEterogeneous Network (HEN) [9], Attributed HEt- 
erogeneous Network (AHEN) [5], Multiplex HEterogeneous 
Network (MHEN) [24], and Attributed Multiplex HEtero- 
geneous Network (AMHEN) [4]. In the university setting, 
students and courses can be mapped into a large heteroge- 
neous network, where students and courses are two types of 
nodes connected by students’ enrollments in courses. The 
proximities between students and courses vary based on the 
grades (e.g., A, B, C, D, etc.) students received for courses, 
yielding the network with multiple views, i.e., multiplex het- 
erogeneous network. Furthermore, if we incorporate the 
attributes of students and nodes (e.g., course catalog de- 
scriptions), the network will turn to an Attributed Multiplex 
HEterogeneous Network (AMHEN), which is illustrated in 
Figure 1. Because students may receive different grades for 
the courses they enrolled, we consider different grades as 
different edge types between students and courses. 


DEFINITION 1. (Attributed Multiplex Heterogeneous Net- 
work): An attributed multiplex heterogeneous network is a 
network G = (V,€,A), E = Urer Er,where E, consists of 
all edges with edge type r € R, and |R| > 1. We separate 
the network for every edge type r € R as Gy = (V,E,, A). 
Each node vi; € VY is associated with some types of feature 
vectors. A = {x;|v; € V} is the set of node features for all 
nodes, where x; is the associated node feature of node v;. 


In the student-course attributed multiplex heterogeneous 
network we described above, V = (C,S), where each node 
c €C represents a course in the course set C and each node 
s € S represents a student in the student set S. R refers 
to all the edge types in the student-course attributed mul- 
tiplex heterogeneous network, i.e., grade types. As students 
have enrollment and grade histories of multiple courses, we 
consider student embeddings as a state of their course knowl- 
edge. Different grade types mirror different levels of course 
knowledge, thus should be represented as different embed- 
dings. 


Given the above definitions and descriptions, we can for- 
mally define our problem for representation learning on the 
student-course AMHEN. 


PROBLEM 1. (Student-Course AMHEN Embedding). Given 
a Student-Course AMHEN G = (C,S,€, A), the problem of 
Student-Course AMHEN embedding is to give a unified low- 
dimensional space representation of each student node s € S 
and each course node c € C on every grade type r.The goal 
is to find a function g : S — R@ and a function f, :C > R@ 
for every grade (edge) type r, where d < |C| (d < |S}). 


4.4.1 Student and Course Representations 

In this section, we detail our adaptation of the AMHEN 
framework[4] to the student-course scenario to learn graph- 
based student and course representations. We split the over- 
all course embedding on each course type r into three parts: 
base embedding be, grade embedding g, and attribute em- 
bedding wu, and split the overall student embedding into two 
parts: base embedding b,, and individual embedding p. 


The base embedding of course node «, i.e., be;, is shared 
between different grade types. We define be; as a parame- 
terized function of c;’s attributes 7; € R® as 


where f is a transformation function, such as a multi-layer 
perceptron. The attribute embedding of course node «, i.e, 
u;, is defined as: 


= D' a; (5) 


Given that in the Student-Course AMHEN, the neighbors of 
a course are all students while the neighbors of students are 
all courses, the k-th level’ of grade embedding gi € R4, 
(1 <k < K) of course node c; on grade type r is jeareseied 
from individual embeddings of students that are c;’s neigh- 
bors, which means these students all received grade type r 
for course c;. 


gS = mean({p-, Vp; € Ni}) (6) 


Similarly, the k-th level of individual embedding p” € R4, 
(1<k< K) of a student node s; is aggregated from eeide 
embeddings of courses that are s;’s neighbors, which demon- 
strates a student’s representation is derived from the grade 
histories of his/her enrolled courses. 


p\* _ = mean({g\h~)) We; € Nir}) (7) 


'By level we mean iteration, i.e., the embedding is updated 
after each parameters update process. 
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Figure 2: Visual summary of representation learning methods 


We denote the k-th level grade embedding gs” as grade 
embedding g;,, and concatenate all the grade embeddings 
for course node cj as Gi € Res”. where d is the dimension 


of grade embeddings and m is the number of grade types. 


Gi = (911, 9:2, ---, Gim) (8) 
We use self-attention mechanism[23] to compute the coeffi- 


cients a;, € R”™ of linear combination of vectors in G; on 
edge type r as: 


air = softmax(w, tanh(W;Gi))” (9) 


where w, € R“ and W, € R%*¢ are trainable parameters 
for grade type r. Thus, the overall embedding of course node 
c; for grade type r is: 


Cir = ach(ai) + My Giair + B.D? x; (10) 


where M, € R?*” and D € R**” are trainable transfor- 
mation matrix. a. and (6. are two coefficients adjusting the 
weights of the three embeddings of courses, which can also 
be trainable. 


The overall embedding of student node s; is: 
Si = Asbs +N" p; (11) 


where a, is a trainable coefficient adjusting the weights of 
the two embeddings of students, and N € R?*” is a train- 
able transformation matrix for the individual embeddings of 
students. 


4.4.2 Model Optimization 


Having the student and course representations constructed, 
we discuss how to generate the training data and learn the 


student and course embeddings. We first separate the whole 
network by edge(grade) type, then given a view (grade type) 
r of the network, i.e., Gr = (C,S,E,, A), we use meta-path- 
based random walk[9] to generate node sequences. There are 
two meta-path schema in the student-course AMHEN, i-e., 
student — course — student or course — student — course. 
Finally, we apply a skip-gram [27, 28] over the node se- 
quences to learn embeddings. The meta-path-based random 
walk strategy ensures that the semantic relationships be- 
tween student nodes and course nodes with different grade 
types can be properly incorporated into the skip-gram model 
[9]. For a training pair (ci, s;) with grade type r, our objec- 
tive is to maximize the probability: 

exp(c}. 8’) 


P(s;\ci,r) = 
’ Dispes XP(CH St) 


(12) 


where sj, is the context embedding of student node s;. For 
a training pair (s;,c;) with grade type r, our objective is to 
maximize the probability: 

exp(8; Cir) 


oe Ec exp(s/ ci...) 


P(cj|8i,7) = (13) 


where cj,,. is the context embedding of course node cz with 
grade type r. Finally, we use heterogeneous negative sam- 
pling to approximate the objective function —logP(s;|ci, r) 
for node pair (c;, s;) as 


zi 
loss(ci, 8;,7) = —logo(ci,s;)— >> Es, ~P(s,) [logo (—cj,.s),))] 
l=1 


(14) 
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and the objective function —logP(c;|s;, 1) for node pair (s;, c;) 
as: 


L 
loss(si, Cj, r) = —loga(si ¢},.)—)— Ee, ~P(cr) [logo(—s; e,))] 
l=1 
(15) 

f(sx)°/* F(cw)?/4 
Wilh £(si)3/4 
according to the Skip-gram model|27], where f refers to the 
frequency of the node in each node type. 


Here we define P(s,) = and P(c,) = 


After optimizing the model with all the parameters learned, 
we reform the overall embedding for course i by concatenat- 
ing its embeddings of all grade types. 


CG = (Ci, Cia, +) Cm) (16) 


5. TASKS 

In this section, we describe five down-stream institutionally 
relevant tasks that can be performed by using the course 
representations constructed by the model approaches intro- 
duced in Section 4. 


5.1. Course Similarity 
An essential way to check the quality and fidelity of the 


course representations introduced in section 4 is to test whether 


they contain important features of courses that could dif- 
ferentiate between similar and dissimilar courses. To this 
end, an equivalency validation set of 1,351 course credit- 
equivalency pairs maintained by the Office of the Registrar 
were used for similarity based ground truth. A course is 
paired with another course in this set if a student can only 
receive credit for taking one of the courses at the university. 
For example, an honors and non-honors version of the same 
course will appear as a pair because faculty have deemed 
that there is too much overlapping material between the 
two for a student to receive credit for both. 


To evaluate different course representations on the course 
equivalency validation set, we fixed the first course in each 
pair and ranked all the other courses according to their co- 
sine similarity to the first course in descending order. We 
then noted the rank of the expected second course in the pair 
and describe the performance of each model on all validation 
pairs in terms of Mean Rank, Median Rank and Recall@10. 


5.2 Enrollment Prediction 

Enrollment prediction involves predicting the courses a stu- 
dent will enroll in, but not the grade they will receive. For 
this reason, it is considered a model of behavior, rather than 
an assessment model. The task could be potentially useful 
for the purpose of providing a normative course taking signal 
that could be used to provide a personalized sorting of course 
results (e.g., showing the courses a student is most likely to 
take that satisfy a remaining requirement) [32]. The input 
of the model in each time slice is a multi-hot vector rep- 
resenting the courses taken in the corresponding semester. 
However, the multi-hot representation has a large dimension 
of total number of courses and may not encode course fea- 
tures apparent in text descriptions of the course or graph- 
based methods. Therefore, we also evaluate substituting 
the multi-hot course input with the sum of pre-trained low- 
dimensional representations from other models, illustrated 


DIL F(ei)3/4 


in Figure 3. Performance on this task is reported in terms 
of Recall@ 10 and Mean Reciprocal Rank@10 (MRR@10). 
MRR evaluates recommender system models that produce 
a list of ranked items for queries. The reciprocal rank is the 
“multiplicative inverse” of the rank of the first correct item. 
MRR is defined as MRR = Tol = nae where rank; rep- 
resents the rank of the first correct recommended item for 
query 7. For calculating MRR@10, the only difference is 
rank; is reset to 0 if rank; > 10. 
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Figure 3: Illustration of the LSTM-based next- 
course prediction 


5.3 Grade Prediction 


Grade prediction is the basis for an assessment model that 
could aid adaptive sequencing of courses to achieve a partic- 
ular goal. In previous work[21], a modified LSTM was de- 
signed to trace students’ course knowledge, which predicted 
students’ grades on enrolled courses in each semester. The 
model gives students the ability to choose their grade goal (A 
or B) or Pass/No-pass. A masked loss function was designed 
to enable the output to predict letter grade and Pass/No- 
pass independently. Two cut-offs (A or B) were also set to 
separate the letter grades into two levels (e.g., higher and 
lower than an ’A’). The input of the LSTM grade predic- 
tion model is also a multi-hot vector with the position of 
grades students received for enrolled courses as 1 and other 
positions as 0. Because there are seven grade types for each 
course, the dimensions of the model input in each time slice 
is the number of courses multiplied by seven. As an alter- 
native to the multi-hot input, we also evaluate the perfor- 
mance of the model using the course grade representations 
learned from the student-course AMHEN model in Section 
4.4, which is illustrated in Figure 4, where g; represents the 
grades of courses taken in semester 7 and c; represents the 
courses taken in semester 7. c;+1 is concatenated with g; to 
incorporate the impact of the co-enrolling effect of courses 
in the predicted semester on grade prediction. 


In addition, the student-course AMHEN model can also pre- 
dict the grades of students by calculating the cosine similar- 
ities between student embeddings and course embeddings, 
and then predicting the grades by picking up the grade of 
each course that is most similar to the target student. 


g(si, cj) = arg max cos(s;, Cjr) (17) 
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Figure 4: Illustration of the LSTM-based grade pre- 
diction 


For the model without grade cut-off, there are seven grade 
types in the student-course AMHEN model representing A, 
B, C, D, F, Pass, and No-pass. A prediction is considered 
correct only if it is exactly the grade a student received in 
the data. For the models with grade cut-off (A or B), we 
group the letter grades not lower than the cut-off as a grade 
type, and the letter grades lower than the cut-off as another 
grade type in the student-course AMHEN model. 


Both the enrollment prediction and grade prediction models 
were trained using a temporal train/test split, with Fall 2008 
through Fall 2015 semesters serving as the training set and 
Spring 2016 as the testing semester. 


5.4 Prerequisite prediction 

Prerequisite course information is essential to encourage or 
mandate that students have the necessary foundational ex- 
perience to be able to learn and succeed in the advanced 
stages of their degree. We used a set of 2,300 prerequi- 
site course pairs, provided by the UC Berkeley Office of the 
Registrar, which contains 1,215 target courses, as a source 
of ground truth to test whether the grade prediction model 
encodes such prerequisite relationships between courses. 


Prerequisite relationships between courses can be inferred 
by inferencing an LSTM-based grade prediction model as 
described in [21] and illustrated in Figure 5. Note that, for 
this evaluation, only one time slice input of the binary-grade 
(A or lower than A) prediction trained LSTM is needed. We 
iterate over all the courses with only one-hot embedded in 
the ‘A’ position for that course, and feed the input, which 
is a concatenation of a target course and grade A of the 
input course, to the LSTM. During the iterations, the in- 
put course that boosted the probability of the ‘A’ position 
of target course to the largest ten values will be selected as 
candidate prerequisite courses for the target course. This ap- 
proach is similar to the prerequisite skill inference conducted 
with DKT [35], but with a much larger vocabulary and with 
ground truth prerequisite structure to validate against. As 
with the other tasks, we also evaluate replacing the input 
of this model with representations from the student-course 
AMHEN graph-embedding approach. 


A simple multinomial logistic regression can alternatively be 


used to predict prerequisites courses using any arbitrary vec- 
tor representation of a course. The input of the multinomial 
logistic regression during training is the vector representa- 
tion of the target course, and the output is a multi-hot of 
the prerequisite courses for the target course. During test- 
ing, the output is a probability distribution across all courses 
where the most probable courses can be taken as the pre- 
requisite predictions of the regression. 


We classified all the models for the prerequisite course pre- 
diction task into two types, supervised and unsupervised, 
based on whether the model was learned using the official 
prerequisite course pairs. For the supervised models (i.e., 
using the regression), we applied 10-fold cross-validation to 
the 2,300 prerequisite course pairs. For the unsupervised 
models (i.e., LSTM-based inferences), described in Section 
5.4, the LSTM with standard course multi-hots as input 
and with graph-based embeddings as input was trained first 
on the supervised task of predicting course grades, and was 
then inferenced in an unsupervised manor (i.e., not using 
any prerequisite ground truth), to predict course prerequi- 
sites. 


N ’ 
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Figure 5: Prerequisite course prediction using 
LSTM-based grade prediction model[21] 


5.5 Average Enrollment Prediction 

Do the representations of courses created by various mod- 
eling techniques encode course popularity information? To 
answer this we test the course representations’ ability to 
predict the average enrollment size of each course. The data 
and models that perform well in this test may be indicative 
of the data and modeling paradigms that would work well 
for temporal versions of this model that could anticipate in- 
creases in course demand and allow institutions to better 
plan room and teaching staff allocations. 


In order to check whether the different types of course em- 
beddings encode information predictive of the number of 
enrollments, we use a simple a multi-layer perceptron to pre- 
dict average enrollment per course using the different types 
of course embeddings introduced in section 4 as candidate 
inputs. RMSE is adopted as the error metric. 


6. EXPERIMENT RESULTS 


We begin this section by reporting a summary of only the 
best performing model and data source pairs used to con- 
struct the input representations for each of our five down- 
stream model predictions tasks. This summarized set of best 
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Table 3: Evaluation of course representation models on various prediction tasks 


Representation Course similarity Enrollment Grade Prerequisite BNE. 
created b rediction rediction rediction rediction enroll 
y P Pp P P predict 
Model ae Mean/ Me- | Recall@10 | Recall@10 | MRR@10| Accuracy Recall@10 | Target | RMSE 
Source(s) dian Rank 
bag-of-words | catalog 602/6 183 0.5370183! | 0.3154 0.5216 - 0.5152 0.5938 42.4781 
bag-of-words | syllabus 329/19 0.4270 0.3744 0.5103 - 0.5658 0.6352 48.8965 
enrollments, 
multi-c2v course meta- | 224/15/83] | 0.4485/83] | 0.3791 0.5576 - 0.6957 0.7733 | 42.4780 
information 
LSTM 2 2 
: enrollments | 584/58 0.2924 0.3967 0.5885 | 0.6952 0.3048/21] | 0.4486l21] 51.4140 
(multi-hot) 
enrollments, 
sc-AMHEN grades, 288/11 0.4767 0.3882 0.5625 0.7008 0.7192 0.8000 | 52.3370 
catalog 


results are shown in Table 3. On the task of course sim- 
ilarity, a simple bag-of-words representation of the course 
catalog description performs best in terms of median rank 
and Recall @ 10 on our credit-equivalency pairs validation 


Table 4: Course similarity validation of all the 
course representations 


Mean/Median Recall 


Model 

set. Enrollment histories provide the second best perform- Rank @10 
ing score using sc-AMHEN network-based embedding, fol- catalog 602/6 0.5372 
lowed by multi-c2v. Scoring similarly to multi-c2v was a syllabus 329/19 0.4270 
simple BOW of the Ims-syllabus data. On the task of pre- course2vec (c2v) 244/21 0.3839 
dicting which courses a student will take next (enrollment multi-c2v (mc2v) 224/15 0.4485 
prediction), an LSTM with a multi-hot input representa- a ae coe 
tion of courses taken in each semester provided the best catalog-+syllabus+-me2v 79/3 0.6705 
performance in terms of both metrics. In this task, using catalog+syllabus+mc2v 177/3 nero 
pre-trained embeddings from the network-based or multi- (PCA dim: 300) : 
c2v approach worked less well than multi-hot, followed by LSTM 584/58 0.2924 
using the content-based representations as inputs, which sc-AMHEN(u) 288/11 0.4767 
performed worst. In grade prediction, the network-based sc-AMHEN(c) 330/27 0.3603 


method performed slightly better than the previous state- 
of-the-art LSTM. On the task of prerequisite prediction, the 
network-based approach performed best in recovering the 
ground-truth prerequisite relationships found in our insti- 
tutional data. The multi-c2v approach was not far behind. 
The content-based and LSTM course representations did not 
perform nearly as well on this task. Finally, on the task of 
predicting the average enrollment of a course, multi-c2v pro- 
vided the lowest RMSE, but with an almost identical score 
achieved by simple BOW of the course catalog description. 


In the subsequent sections we provide a more detailed break- 
down of performance of all model and data combinations on 
the tasks of course similarity, grade prediction, and prerequi- 
site prediction. Results of enrollment prediction and average 
enrollment prediction are already shown in full in Table 3. 


6.1 Course Similarity 

The evaluation results on the equivalency validation set of 
1,351 course credit-equivalency pairs are shown in Table 4. 
The bag-of-words representations (Tfidf) generated from 
course catalog descriptions achieved better median rank and 
recall@10 than those generated from the course syllabus 
data. However, the mean rank of the catalog-based rep- 
resentations is the worst among all the models, which sug- 
gests there are many outliers where literal semantic simi- 
larity (bag-of-words) is very poor at identifying equivalent 
pairs. Concatenations of the bag-of-words based methods 
and course2vec-based method increased the evaluation met- 


rics, especially when the bag-of-words representations of cat- 


alog and syllabus were combined with the multi-factor course2vec 


representations, reaching a mean/median rank of 79/3 and 
recall@10 of 0.6705, the best among all the models. A Prin- 
cipal Component Analysis (PCA) transformation of the con- 
catenated course vectors from 10,000 to 300 did not diminish 
the median rank metric, but slightly negatively affected av- 
erage rank and recall. The course representations learned 
from the next-course prediction LSTM performed the worst 
among all the models. Course attribute embeddings sourced 
from the student-course AMHEN (sc-AMHEN) model, per- 
formed second best among all single representation models. 


6.2 Grade Prediction 

The accuracy of the grade predictions generated by the pure 
student-course AMHEN model (sc-AMHEN(s, c)), the LSTM 
model with mult-hot as input (LSTM(multi-hot)), and the 
LSTM model with course embeddings with different grade 
types (LSTM(u, c)) are listed in Table 5. Among the three 
models, the pure student-course AMHEN model is a kind 
of static model learned from students’ enrollment data with 
grades and course catalog descriptions, while the two LSTM- 
based models are dynamic models taking into consideration 
not only the student enrollment data with grades, but also 
the sequential informaion (semester order) of the grades of 
enrolled courses. The grade prediction results show that 
the graph model, though static, could map the knowledge 
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Table 5: Grade prediction evaluation (accuracy) 


Cut- Letter Pass/ 


Mode) TyPe off grade No-pass a 
sc-AMHEN static - 0.5441 0.7972 (0.5976 
(s, ) 
LSTM : 
(multi-hot dynamic - 0.6382 0.9079 0.6952 
LSTM (u,c) dynamic - 0.6418 0.9209 0.7008 
a static A 0.5526 0.7791 0.6004 
4 
LSTM : 
(multi-hot dynamic A 0.7523 0.8581 0.7633 
LSTM (u,c) dynamic A 0.7571 0.9135 0.7902 
a static B 0.8299 0.8205 0.8279 
? 
LSTM : 
(eaulticNot dynamic B 0.8805 0.9178 0.8884 
LSTM (u,c) dynamic B- 0.8817 0.9185 0.8895 


levels of students on the features of courses with different 
grade types to a certain degree, resulting in prediction accu- 
racies higher than 0.5 for all grade types and higher than 0.6 
and 0.8 for binary grades (“not lower than cut-off” v.s.“lower 
than cut-off’, Pass v.s. No-pass) on average. Furthermore, 
the sequential information of students’ grades by semesters 
exhibited substantial importance as the prediction accuracy 
of the two LSTM-based models manifested superiortiy to 
the static student-course AMHEN model by a significant 
margin. Moreover, the course embeddings with different 
grade types learned from the student-course AMHEN model 
helped increase the accuracy of grade prediction over the 
multi-hot vectors as the input of the LSTM. The potential 
reasons could be the course embeddings with different grade 
types captured the knowledge relations among grades of a 
course and the relations among different courses, thus could 
represent the knowledge of students more accurately than 
multi-hot, which could not encode any knowledge relations 
among grades. Although the positive impact of incorporat- 
ing grade embeddings on grade prediction (improvement at 
the 0.01 level) are not so salient as the advantage of bringing 
in sequential information (improvement at the 0.1 level), it is 
manifested in all the evaluations with different grade types. 


6.3 Prerequisite prediction 
The evaluation results of prerequisite course prediction are 
shown in Table 6. The supervised models performed dramat- 


Table 6: Prerequisite course prediction 


: Pairs Target 

Medel pupenyaged (Recall@10) oie 
LSTM(one-hot) x 0.3048 0.4486 
LSTM(u, c) x 0.2423 0.3580 
catalog v 0.5152 0.5938 
syllabus v 0.5658 0.6352 
mc2v v 0.6957 0.7733 
sc-AMHEN(u, c) v 0.7192 0.8000 


ically better in reconstructing the prerequisite pairs. Among 
all types of course representations, the course embeddings 


and grade embeddings learned from the student-course AMHEN 


performed the best, reaching 71.92% of the prerequisite pairs 


correctly predicted and 80% of all the target courses with 
at least one of their prerequisite course correctly predicted. 
For unsupervised models, we found one-hot representation of 
courses performed better than course and grade embeddings 
in the prerequisite course inference framework described in 
Section 5.4. 


7. CONCLUSIONS 


In this paper, we evaluated the utility of two content-sources 
of data about courses, catalog descriptions and syllabi, as 
well as enrollment histories and grades. We paired these 
sources with four different representations produced by sim- 
ple bag-of-words, multifactor Course2vec, LSTM, and network- 
based embedding. We compared the performance of these 
pairings on five prediction tasks, course similarity, enroll- 
ment prediction, grade prediction, prerequisite prediction, 
and average enrollment prediction. 


On the topic of the utility of syllabus data, which has not 
been evaluated before, we found that it showed benefit over 
catalog description data only in inferring prerequisite rela- 
tionships (Recall of 0.5658 vs 0.5152), perhaps due to syllabi 
being the finer-grained source of content information about a 
course. In terms of course similarity signal, catalog descrip- 
tion was markedly better than syllabus (Recall of 0.5372 vs 
0.427) and our results indicate that catalog description, syl- 
labus, and enrollment histories all bring some level of com- 
plementary information as the combination of all three per- 
formed better than any one or two combined. Enrollment 
data was used in the best scoring model in four of the five 
tasks, with only the best performing course similarity task 
model not utilizing enrollments. The nascent network-based 
approach performed well on all tasks, and was the top model 
in grade prediction and prerequisite prediction. 


To conclude: (1) syllabus data is worth the effort to col- 
lect compared to catalog description for prerequisite predic- 
tion and (2) complements the catalog description and enroll- 
ment data on the course similarity task, (3) for prerequisite 
learning, supervised approaches based on embeddings per- 
form much better than inferencing a pre-trained assessment 
model, (4) multifactor Course2vec often performs close to 
the more complex network-based approach on all tasks and 
(5) seeding the LSTM with course representations from the 
other models did not improve next-course prediction per- 
formance, while seeding with course grade representations 
from the student-course AMHEN model provided a small 
improvement in the grade prediction task. 


8. LIMITATIONS AND FUTURE WORK 


Our analyses were limited to data from a single large pub- 
lic institution in the US. Future work will need to evaluate 
multiple institutions of varying sizes, student demographics, 
and course taking policies in order to examine the generaliz- 
ability of these approaches. In terms of models, we focused 
on simple text-based approaches and more complex neural 
models, both well established and nascent. Classical models 
of intermediary complexity were not evaluated. 


We included tasks that have been common in EDM papers 
involving enrollment data; however, other institutional tasks 
exist that could be evaluated to produce an even more com- 
prehensive analysis. These tasks include course preparation 
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recommendation [21, 20], degree or course attrition predic- 
tion, and future course demand forecasting. 


Syllabi in their original form could be evaluated, instead 
of in bag-of-words form, in order to investigate if the posi- 
tionality of words in the syllabi offered any additional pre- 
dictive utility. Lastly, learning management system click- 
stream data, as well as content information in addition to 
the syllabus, could be leveraged to enhance both content- 
based and collaborative-based course representations. This 
combination of different modalities and scales of data is an 
identified open challenge for the field [14]. 
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